Search CORE

9 research outputs found

Controllable Text Summarization: Unraveling Challenges, Approaches, and Prospects -- A Survey

Author: Mishra Pruthwik
Mishra Rahul
Roy Tathagato
Urlana Ashok
Publication venue
Publication date: 15/11/2023
Field of study

Generic text summarization approaches often fail to address the specific intent and needs of individual users. Recently, scholarly attention has turned to the development of summarization methods that are more closely tailored and controlled to align with specific objectives and user needs. While a growing corpus of research is devoted towards a more controllable summarization, there is no comprehensive survey available that thoroughly explores the diverse controllable aspects or attributes employed in this context, delves into the associated challenges, and investigates the existing solutions. In this survey, we formalize the Controllable Text Summarization (CTS) task, categorize controllable aspects according to their shared characteristics and objectives, and present a thorough examination of existing methods and datasets within each category. Moreover, based on our findings, we uncover limitations and research gaps, while also delving into potential solutions and future directions for CTS.Comment: 19 pages, 1 figur

arXiv.org e-Print Archive

Technology Pipeline for Large Scale Cross-Lingual Dubbing of Lecture Videos into Multiple Indian Languages

Cross-lingual dubbing of lecture videos requires the transcription of the original audio, correction and removal of disfluencies, domain term discovery, text-to-text translation into the target language, chunking of text using target language rhythm, text-to-speech synthesis followed by isochronous lipsyncing to the original video. This task becomes challenging when the source and target languages belong to different language families, resulting in differences in generated audio duration. This is further compounded by the original speaker's rhythm, especially for extempore speech. This paper describes the challenges in regenerating English lecture videos in Indian languages semi-automatically. A prototype is developed for dubbing lectures into 9 Indian languages. A mean-opinion-score (MOS) is obtained for two languages, Hindi and Tamil, on two different courses. The output video is compared with the original video in terms of MOS (1-5) and lip synchronisation with scores of 4.09 and 3.74, respectively. The human effort also reduces by 75%

arXiv.org e-Print Archive

GermEval 2018 : Machine Learning and Neural Network Approaches for Offensive Language Identification

Author: Lanka Soujanya
Mishra Pruthwik
Mujadia Vandan
Publication venue: oeaw
Publication date: 02/10/2018
Field of study

Social media has been an effective carrier of information from the day of its inception. People worldwide are able to interact and communicate freely without much of a hassle due to the wide reach of the social media. Though the advantages of this mode of communication are many, the severe drawbacks can not be ignored. One such instance is the rampant use of offensive language in the form of hurtful, derogatory or obscene comments. There is a greater need to employ checks on social media websites to curb the menace of the offensive languages. GermEval Task 2018 1 is an initiative in this direction to automatically identify offensive language in German Twitter posts. In this paper, we describe our approaches for different subtasks in the GermEval Task 2018. Two different kinds of approaches - machine learning and neural network approaches were explored for these subtasks. We observed that character n-grams in Support Vector Machine (SVM) approaches outperformed their neural network counterparts most of the times. The machine learning approaches used TF-IDF features for character n-grams and the neural networks made use of the word embeddings. We submitted the outputs of three runs, all using SVM - one run for Task 1 and two for Task 2

Elektronisches Publikationsportal der Ãsterreichischen Akademie der Wissenschaften

Elektronisches Publikationsportal der Österreichischen Akademie der Wissenschaften

Children learn ergative case marking in Hindi using statistical preemption and clause-level semantics (intentionality):evidence from acceptability judgment and elicited production studies with children and adults

Author: Ambridge Ben
Bhaya Nair Rukmini
Maitreyee Ramya
Mishra Pruthwik
Misra Sharma Dipti
Narasimhan Bhuvana
Samanta Soumitra
Saxena Gaurav
Publication venue
Publication date: 01/01/2023
Field of study

Background: A question that lies at the very heart of language acquisition research is how children learn semi-regular systems with exceptions (e.g., the English plural rule that yields cats, dogs, etc, with exceptions feet and men). We investigated this question for Hindi ergative ne marking; another semi-regular but exception-filled system. Generally, in the past tense, the subject of two-participant transitive verbs (e.g., Ram broke the cup) is marked with ne, but there are exceptions. How, then, do children learn when ne marking is required, when it is optional, and when it is ungrammatical? Methods: We conducted two studies using (a) acceptability judgment and (b) elicited production methods with children (aged 4-5, 5-6 and 9-10 years) and adults. Results: All age groups showed effects of statistical preemption: the greater the frequency with which a particular verb appears with versus without ne marking on the subject – relative to other verbs – the greater the extent to which participants (a) accepted and (b) produced ne over zero-marked subjects. Both children and adults also showed effects of clause-level semantics, showing greater acceptance of ne over zero-marked subjects for intentional than unintentional actions. Some evidence of semantic effects at the level of the verb was observed in the elicited production task for children and the judgment task for adults. Conclusions: participants mainly learn ergative marking on an input-based verb-by-verb basis (i.e., via statistical preemption; verb-level semantics), but are also sensitive to clause-level semantic considerations (i.e., the intentionality of the action). These findings add to a growing body of work which suggests that children learn semi-regular, exception-filled systems using both statistics and semantics

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

The University of Manchester - Institutional Repository

Children learn ergative case marking in Hindi using statistical preemption and clause-level semantics (intentionality): evidence from acceptability judgment and elicited production studies with children and adults [version 2; peer review: 1 approved, 2 approved with reservations]

Author: Ben Ambridge
Bhuvana Narasimhan
Dipti Misra Sharma
Gaurav Saxena
Pruthwik Mishra
Ramya Maitreyee
Rukmini Bhaya Nair
Soumitra Samanta
Publication venue: F1000 Research Ltd
Publication date: 01/09/2023
Field of study

Directory of Open Access Journals

The University of Manchester - Institutional Repository

Explore Bristol Research

Universal Segmentations 1.0 (UniSegments 1.0)

Author: Angle Sachi
Ansari Ebrahim
Arkhangelskiy Timofey
Bafna Nyati
Batsuren Khuyagbaatar
Bella Gábor
Bertinetto Pier Marco
Bodnár Jan
Bonami Olivier
Celata Chiara
Daniel Michael
Fedorenko Alexei
Filko Matea
Giunchiglia Fausto
Haghdoost Hamid
Hathout Nabil
Khomchenkova Irina
Khurshudyan Victoria
Kyjánek Lukáš
Levonian Dmitri
Litta Eleonora
Medvedeva Maria
Muralikrishna S. N.
Namer Fiammetta
Nikravesh Mahshid
Padó Sebastian
Passarotti Marco
Plungian Vladimir
Polyakov Alexey
Potapov Mihail
Pruthwik Mishra
Rao B Ashwath
Rubakov Sergei
Samar Husain
Sharma Dipti Misra
Svoboda Emil
Talamo Luigi
Tribout Delphine
Vidra Jonáš
Vodolazsky Daniil
Vydrin Arseniy
Zakirova Aigul
Zeller Britta
Ševčíková Magda
Šnajder Jan
Šojat Krešimir
Štefanec Vanja
Žabokrtský Zdeněk
Publication venue: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication date: 17/01/2022
Field of study

Universal Segmentations (UniSegments) is a collection of lexical resources capturing morphological segmentations harmonised into a cross-linguistically consistent annotation scheme for many languages. The annotation scheme consists of simple tab-separated columns that stores a word and its morphological segmentations, including pieces of information about the word and the segmented units, e.g., part-of-speech categories, type of morphs/morphemes etc. The current public version of the collection contains 38 harmonised segmentation datasets covering 30 different languages

LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University